Exploring the sawa corpus: collection and deployment of a parallel corpus English - Swahili
نویسندگان
چکیده
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the SAWA corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus. The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries. We particularly focus on the difficulties The research presented in this paper was made possible through the support of the VLIR-IUC-UON program and was partly funded by the SAWA BOF UA-2007 project. The first author is funded as a Postdoctoral Fellow of the Research Foundation—Flanders (FWO). G. De Pauw (&) CLiPS, Department of Linguistics, University of Antwerp, Antwerp, Belgium e-mail: [email protected] G. De Pauw P. W. Wagacha School of Computing and Informatics, University of Nairobi, Nairobi, Kenya P. W. Wagacha e-mail: [email protected] G.-M. de Schryver Department of African Languages and Cultures, Ghent University, Ghent, Belgium e-mail: [email protected] G.-M. de Schryver Xhosa Department, University of the Western Cape, Cape Town, South Africa 123 Lang Resources & Evaluation (2011) 45:331–344 DOI 10.1007/s10579-011-9159-7
منابع مشابه
Towards English - Swahili Machine Translation
Even though the Bantu language of Swahili is spoken by more than fifty million people in East and Central Africa, it is surprisingly resource-scarce from a language technological point of view, an unfortunate situation that holds for most, if not all languages on the continent. The increasing amount of digitally available, vernacular data has prompted researchers to investigate the applicabilit...
متن کاملThe SAWA Corpus: A Parallel Corpus English - Swahili
Research in data-driven methods for Machine Translation has greatly benefited from the increasing availability of parallel corpora. Processing the same text in two different languages yields useful information on how words and phrases are translated from a source language into a target language. To investigate this, a parallel corpus is typically aligned by linking linguistic tokens in the sour...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملStrategies Used in the Translation of Interlingual Subtitling
This study was an attempt to identify the interlingual strategies employed to translate English subtitles into Persian and to determine their frequency, as well. Contrary to many countries, subtitling is a new field in Iran. The study, a corpus-based, comparative, descriptive, non-judgmental analysis of an English-Persian parallel corpus, comprised English audio scripts of five movies of differ...
متن کاملVocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Language Resources and Evaluation
دوره 45 شماره
صفحات -
تاریخ انتشار 2011